Phonemic or phonetic sub-word units are the most commonly used atomicelements to represent speech signals in modern ASRs. However they are not theoptimal choice due to several reasons such as: large amount of effort requiredto handcraft a pronunciation dictionary, pronunciation variations, humanmistakes and under-resourced dialects and languages. Here, we propose adata-driven pronunciation estimation and acoustic modeling method which onlytakes the orthographic transcription to jointly estimate a set of sub-wordunits and a reliable dictionary. Experimental results show that the proposedmethod which is based on semi-supervised training of a deep neural networklargely outperforms phoneme based continuous speech recognition on the TIMITdataset.
展开▼